Photo classification using optical parameters of camera from exif metadata

ABSTRACT

A method of classifying and organizing digital images utilizing optical metadata (captured using multiple sensors on the camera) may define semantically coherent image classes or annotations. The method defines optical parameters based on the physics of vision and operation of a camera to cluster related images for future search and retrieval. An image database constructed using photos taken by at least thirty different users over a six year period on four different continents was tested using algorithms to construct a hierarchal clustering model to cluster related images. Additionally, a survey about the most frequent image classes shot by common people forms a baseline model for automatic annotation of images for search and retrieval by query keyword.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims benefit under 35 U.S.C. § 119(e) of U.S.Provisional Application having Ser. No. 60/914,544 filed on Apr. 27,2007, which is hereby incorporated by reference herein in its entirety.

BACKGROUND OF THE INVENTION

The present invention relates generally to organizing digital files andin particular, to organizing digital photos using optical parametersstored in camera metadata.

Much of the research in content based image classification and retrievalhas focused on using two types of information sources: the pixel layerof an image and text present along with an image. It is known in the artof image retrieval systems to use various measures on the image featuressuch as, color, texture, or shape. Other methods search images for localfeatures such as edges, salient points, or objects. Algorithms have alsobeen proposed which find scale and rotation invariant distinctivefeature points in an image. These systems help in image matching andquery by example. But image search using an example or low levelfeatures might be difficult and non-intuitive to some people. Ratherimage search using keywords has become more popular nowadays.

The prior art has used mapping on low level image features tosemantically classify coherent image classes. It is known to use analgorithm on color and texture features to classify indoor outdoorimages. One publication, “Content Based Hierarchical Classification ofVacation Images”. In Proc. IEEE Multimedia Computing and Systems, June1999 (518-523), by Vailaya et al. discloses uses of a hierarchicalstructure to classify images into indoor-outdoor classes; then outdoorimages into city and landscape. Other applications, such as image searchengines rely on text, tags, or annotations to retrieve images.

Research using the annotations/tags or text accompanying an image in theprior art has been used to derive the human meta information from textaccompanying the image. As disclosed in “Integration of visual andtext-based approaches for the content labeling and classification ofphotographs” by Paek et al., they then combine the image features andtext labels to classify photographs.

In some of the prior art, human agents are used to tag some images usingpredefined tags. An algorithm then predicts some tags on untaggedimages. This approach suffers from the fact that it is non trivial todefine particular image classes especially for large heterogeneous imagedatabases. Some may find that tagging an image to a particular classdepends on the user's perception on a particular image.

Other prior art approaches have used the Optical Meta layer to classifyand annotate images. Some use this layer to help improve classificationusing the pixel layer such as by using pixel values and optical metadatafor sunset scene and indoor outdoor classification. Such approaches maychoose the most significant cue using K-L divergence analysis. Othersuse a color, texture and camera metadata in a hierarchical way toclassify indoor and outdoor images. But indoor-outdoor are considered bysome very broad classes to actually help in any annotation or retrieval.Also these approaches lack the use of any strong reference to physics ofvision (of why the images were being classified using the chosen cues).Further, the training sets used in the research have been artificiallycreated for a specific purpose only.

As can be seen, there is a need for an improved method of classifyingdigital images for organization and retrieval that exploits inherentoptical parameters for intuitive grouping by extracting similar opticalfeatures. Furthermore, it can be seen that a need exists for a methodthat automatically annotates digital images based on similar opticalfeatures.

SUMMARY OF THE INVENTION

In one aspect of the present invention, a method for classifying digitalimages comprises clustering optical parameters of the digital imagesinto a set of meaningful clusters, associating the set of meaningfulclusters to a set of associated classes used by a user, and classifyingthe digital images according to the set of associated classes.

In another aspect of the present invention, a method for organizingdigital images comprises deriving optical parameters from the digitalimages, accessing a set of subject classes commonly used by a user andassembled into predefined parameters, determining what the user wastrying to capture in the digital image by associating the derivedoptical parameters with the set of digital image subject classes, andorganizing the digital images into classifications according to theassociations determined by the derived optical parameters.

In yet another aspect of the present invention, a method for usingoptical metadata of a digital image to classify the digital imagecomprises analyzing the optical metadata to find clusters of digitalimages having similar optical concepts, comparing the clusters withhuman induced classes and corresponding the human induced classes with aclassification for the digital image.

These and other features, aspects and advantages of the presentinvention will become better understood with reference to the followingdrawings, description and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a layered digital image structureaccording to the present invention;

FIG. 2 is an illustration of successive F-numbers and related aperturesizes in a camera;

FIG. 3 is an illustration depicting focal length in relation to field ofview;

FIG. 4 is a series of graphs showing the distributions of: a: ExposureTime, b: Log Exposure Time, c: Focal Length d: F-Number, e: Flash, f:Metering Mode;

FIG. 5 is a graph showing ISO speed rating and exposure time;

FIG. 6 is a series of graphs showing the distribution of Log lightmetric for images without flash (a) and with flash (b);

FIG. 7 is a series of graphs showing the Gaussian Mixture Models forphotos without (a) and with flash (b);

FIG. 8 is a series of graphs showing the BIC criterion values;

FIG. 9 is a series of graphs showing the Gaussian mixtures on aperturediameter and focal length;

FIG. 10 shows two scatter plots of aperture diameter vs. focal length;

FIG. 11 shows two scatter plots of aperture diameter vs. focal length;

FIG. 12 shows a map of human classes to optical clusters;

FIG. 13 shows a series of models showing a Bayes Net modeling of images;

FIG. 14 is a set of annotated images with manual and predicted tags;

FIG. 15 is a set of annotated images with manual and predicted tags;

FIG. 16 shows a series of graphs describing the distribution of humaninduced classes over optical clusters;

FIG. 17 schematically represents a series of steps involved in a methodfor classifying digital photo images, according to another embodiment ofthe invention; and

FIG. 18 is a table depicting predicted annotations for untagged images.

DETAILED DESCRIPTION OF THE INVENTION

The following detailed description is of the best currently contemplatedmodes of carrying out the invention. The description is not to be takenin a limiting sense, but is made merely for the purpose of illustratingthe general principles of the invention, since the scope of theinvention is best defined by the appended claims.

The traditional camera records information (coming through the incidentlight) on films using chemical symbols. The digital camera has CCD/CMOSimage sensors which capture visual signals and store them in theelectromagnetic memory. But apart from visual signal, the digital camerastores other context information too. Hence, referring to FIG. 1, anexemplary multilayered structure of a digital photograph 101 may beproposed. The layers are i. Pixel/Spectral layer 110 and ii. Meta Layer120. The pixel layer 110 may contain the information recorded by the CCDas pixel values. The Meta Data Layer may have the following sub layers:a. Optical Meta Layer 130; b. Temporal Meta Layer 140. c. Spatial MetaLayer 150; d. Human Induced Meta Layer 160; and e. Derived Meta Layer170. The optical meta layer 130 may contain the metadata related to theoptics of the camera; e.g., the focal length, aperture, exposure timeetc. These metadata contain important cues about the context in whichthe image was shot (like the lighting condition, depth of field anddistance of subjects in the image). The temporal meta layer 140 maycontain the time stamp of the instant in which the photo was shot. Thetime stamp of a single image in a standalone environment may not beinformative enough. But in a collection of images (e.g., photo albums)the time difference can shed valuable light on the content of theimages. The spatial meta layer 150 may contain the spatial coordinatesof the places where pictures were shot. These coordinates are generatedby the GPS systems attached to the camera. Today some off the shelfcameras do not have GPS support. This contains the tags/comments/ratingsposted by people. Community tagging (in online photo albums) also helpsto generate data for this layer. The Derived Meta Layer 170 metadata canbe inferred from other information by learning algorithms, e.g.,automatic annotation. The taxonomy defined above helps us to define thesources of information present in a digital camera image. Presently, thespectral, optical and temporal layers are present in almost all digitalphotographs, while the spatial, human induced and Derived Meta layersmay or may not be present.

Most off the shelf digital cameras (both point and shoot and SLR (singlelens reflex)) have built-in electronics to determine the outsideillumination, subject distances etc. Cameras in auto mode or differentpreset modes (Portrait/Night etc) use these electronics to tune thecamera optics and store the information in the optical meta layer 130.The present invention uses the information stored in the optical metalayer in inferring the content of an image. The present invention usesunsupervised learning algorithms to find clusters of images havingsimilar ‘optical concepts’. The clusters are then compared with humaninduced semantic classes to show how they correspond.

The present invention may have, as representative examples, manyadvantages. For example, the optical meta information is used to inferthe semantic content of an image. A probabilistic model is used toexpress the inference. Since the information in the optical meta layeris stored as real valued numbers (obtained from sensors), it can beretrieved and processed fast. Thus an algorithm can be used in the toplevel of any hierarchical image classification/automatic annotationsystem. Additionally, rather than using optical meta data as independenttags, novel metrics may be defined dependent on multiple optical metatags as explained by the physics of vision and camera. Unlike relatedimage classification research, a small set of classes may not be solelyconsidered (like indoor-outdoor or city-landscape). A survey was used todetermine the most common classes amateurs like to shoot using the offthe shelf digital cameras. These human-induced classes were used assemantic image concepts. Their correspondence may be shown with theclusters defined by Optical Meta layer 130. The image data set derivedfrom the survey consists of personal photos from at least thirtydifferent amateur users. They were shot in a completely unconstrainedenvironment, through a time span of six years and on four differentcontinents. Hence the data set may be considered highly heterogeneous.

Background Study on Camera Parameters

The Exchangeable Image File (EXIF) Standard specifies the cameraparameters to be recorded for a photo shoot. The actual parametersrecorded depend on the particular camera manufacturer. But there arecertain fundamental parameters which are recorded by all popular cameramodels. These are exposure time, focal length, f-number, flash, meteringmode and ISO. An examination of seven different camera models (Canon,Casio, Sony, Nikon, Fuji, Kodak, and Konica) showed some commonalties.All of them have these parameters in the EXIF header. In the imagedatabase of 30,000 photos, over 90% of the images have these attributes.Subject distance may be one an important optical parameter which hasbeen used to infer the image content. However it showed up only presentin less than ten percent of the images. These parameters help capturethe information related to the intent of the photographer. This intentcan be inferred using parameter values and sophisticated analysis andthus, may be very effective in classification and annotation of images.In the following, descriptions of the parameters used for coding theintent are discussed and in subsequent sections, approaches to decodethe intent from the parameters are developed.

Exposure Time/Shutter Speed

The time interval for which the shutter of a camera is kept open toallow the external light into the camera is known as exposure time. Itis measured in seconds. Exposure time is directly proportional to theamount of light incident on the image plane. It also controls the motionblur in an image.

Aperture/F-Number

Referring to FIG. 2, aperture relates to the size of the opening throughwhich light enters a camera. This size is controlled by a diaphragm overthe opening which can close in or open up. The aperture size affects theamount of light on the image plane and the depth of field. Rather thanusing absolute aperture diameters, photographers use relative aperture,expressed as f-numbers or f-stops. Some point and shoot cameras havesome discrete values of the f-stops. Mathematically,

F-Number=Focal Length/Diameter of the aperture

Generally, point and shoot cameras have some discrete values of theF-stops. Each successive f-stop halves or doubles the amount of lightentering the camera.

Focal Length

With reference to FIG. 3, a focal plane is the plane where subjects atinfinity are focused (by a lens). The distance between the opticalcenter of the lens to the focal plane is known as the focal length. Thefield of view of the camera is determined by the focal length and thesize of the camera sensor. Short focal length lens have wide field ofviews (wide angle) and long focal length lens have narrow field of view(telephoto). Generally, the point and shoot digital cameras have shortfocal lengths and small image sensors. Hence, they generally producewide angle images. FIG. 3 depicts a general relationship between thefocal length of camera lenses and their effects on field of view.

Flash

Flash is the artificial light source within a camera. Other than usingthe flash for a shot in dark regions, it can also be used as ‘fill inflash’ in bright outdoor shots. This will help to make the shadowy areasless dark and decrease contrast in lighting. Flash status(fired/not-fired) may be stored in the metadata.

Film Speed/ISO

ISO speed ratings indicate the level of sensitivity of the image sensor(CCD) towards light. In traditional film cameras, ISO sensitivity isassociated with a film stock. Lower ISO films are relatively finer ingrain but require more light (e.g., outdoor day shots). High ISO films(more sensitive) are required for low light or action photography; butcan produce grainy images. In digital cameras ISO speed can be changeddepending on circumstances without any media change. Hence the ISO speedrating of a digital camera may not be much related to the ISO rating ofa film camera.

Distribution of Camera Parameters

A 30 thousand digital photograph database was created from images takenfrom at least thirty different users. Images were also gathered from theMIT Label Me project and the SIMPLIcity project. Most of the images wereshot using point and shoot cameras and in completely unconstrainedenvironment. Spatially, the images are from at least four differentcontinents (North and South Americas, Europe, Asia), and temporally theyspan a period of six years. One goal of using such a database is to findthe distribution of optical meta data in amateur photographs for help inclassification and retrieval of digital images. Due to the heterogeneityof the dataset it can be inferred to model an online community photoalbum.

FIGS. 4( a)-4(f) show the distribution of the various parameters in theimage database. Referring to FIG. 4( a) the distribution of exposuretime (in sec) may be considered highly skewed. Less than one percent ofthe images have an exposure time of more than 0.5 second. Thelog-exposure time distribution is shown in FIG. 4( b). Distribution offocal length in millimeters is shown in FIG. 4( c). Since most of theimages were shot by regular point and shoot cameras (which typically arewide angle with smaller relative aperture), the majority of images havea focal length in the range 0-100 mm. About 1-2% of the images have afocal length more than 100 mm. The distribution of F-Numbers is alsoskewed towards the lower end of the spectrum as seen in FIG. 4( d).Referring to FIG. 4( e), flash is a hex byte and its distribution showsvarious states of the flash, detection of reflected light and red eyedetection mechanism. Most of the images have a metering mode five (multizone). A small percentage has values 2 (spot) and 3 (center weightedaverage). Amateurs typically shoot photos in auto mode. The camera makesdecisions on the optical parameters based on the feedback from othersensors (like light meters). One aspect of the present invention mayinvert this process and infer image content based on the opticalparameters.

Visual Cues Extracted from Optical Parameters Amount of Incident Light

Some in the art may interpret the distributions of the opticalparameters to indicate that none of the parameters have sufficientdiscriminative power for meaningful classification when consideredindependently. However, the joint distribution of the parameters may beexamined for important visual cues. One important cue which is hidden inthe Optical Meta layer is the amount of ambient light when a photo wasshot. Exposure time and aperture size may provide a strong hint aboutthe amount of light. The camera's response to the incident light dependson the ISO speed ratings. In traditional film cameras, a particular filmstock had a predefined ISO speed rating; hence other optical parametersmay be changed to take different shots with a particular film. But indigital cameras, the ISO can be changed independently from otherparameters.

Referring to FIG. 5, a chart shows ISO speed rating and exposure timeare uncorrelated. To estimate the ambient lighting condition a metricmay be defined based on the premise that the amount of light entering acamera is directly proportional to:

the exposure time (ET),Area of the aperture opening (AP-Area),ISO speed rating of the sensor (ISO).Thus this measure can be expressed as:

Light Metric=ET×AP-Area×ISO,

where, the proportionality may be considered constant as 1 and the logof this value is called it the log Light Metric. Note that, log LightMetric will have a small value when the ambient light is high (thecamera will have a low exposure time, small aperture and low ISO).Similarly it will have a large value if the outdoor light is small. Alsothe camera itself can create some artificial light using the flash.Hence one may study the distribution of this light metric separately onphotographs with flash and without flash as shown in FIGS. 6( a) and6(b).

Depth of Field (DOF)

The lens generally focuses at a particular distance (a plane) in frontof the camera. All objects on the focusing plane are sharp; andtheoretically objects not on the focusing plane are blurred. However,due to the constraints of the human eye, some areas in front and behindthe focused subject appear acceptably sharp. This area is known as depthof field. The depth of field depends on the aperture size (diameter),the subject distance, the target size and the focal length. Decreasingthe aperture diameter increases the depth of field and vice versa. Somebelieve that, if the target size on the image plane remains constantthen DOF is independent of focal length. But to keep the target sizeconstant over a range of focal lengths, the subject distance needs tochange. Hence the photographer has to move a lot. But in normal dailyshots, amateurs hardly care about maintaining a fixed target size on theimage plane. Hence a longer focal length usually indicates a shallowdepth of field as it flattens perspective (especially in outdoor shots).The aperture diameter can be decreased to increase the depth of field;but decreasing aperture also limits the amount of light entering thecamera. To make sure a considerable amount of light enters the imageplane after diffraction, the aperture opening should not be madearbitrarily small. A small target size (e.g., flowers, insects,decorative objects) will lead to lower DOF as the goal of such images isto separate the target out from the background.

Unsupervised Clustering

Unlike the prior art, the present invention derives optical parametersfrom digital images to model the human concept of image content. Thesurvey indicates that classes of images in personal photo albums may behighly diverse and overlapping. It may be very difficult to come up withsome predefined set of class names which are mutually exclusive andexhaustive. Further associating an image with a particular class dependson the user's perception. For example, the shot of a baby may beassigned class names such as: baby, portrait, person; the shot of peoplein restaurants can be assigned class names: restaurants, parties, peopleat dinner, etc. Thus a single image may be assigned to multiple classesby the same or different person. Also the knowledge of a particularincident may generate multiple class names like birthday party, marriageanniversary or family get-together. Hence without tagging images into aset of class names, unsupervised learning techniques may be used to findclusters of images with similar ‘optical concepts’. To see how theseoptical concepts map to human concepts of subject image classes commonlyused, surveys were performed about types of images amateurs generallyshoot. Then an examination of how these human defined classes correlatedwith the unsupervised clusters was performed.

The Clustering Model

A hierarchical clustering method was chosen to find similar ‘opticalconcepts’ in the image database. At each level, the most importantexemplary visual cue was chosen. Then, the distribution of the visualcue was modeled as a mixture of Gaussians. This has two advantages: Dueto the hierarchical structure, one can infer which visual cue isaffecting which clusters, and give a hypothesis on the content of thecluster. Without prior knowledge of the distribution of the opticalparameters, a Bayesian Model Selection may be used to find the optimummodel and Expectation Maximization (EM) algorithm to fit the model. WhenEM is used to find the maximum likelihood, the Bayesian model selectioncan be approximated by the Bayesian Information Criterion (BIC).

BIC=LL−NumParam/2×log(N), where

i. LL=Log Likelihood of the data for the Modelii. =Log [Prob(data|Model)], and NumParam=number of independentparameters in the model. For a Gaussian mixture model, the parametersare the means and covariance. Hence, NumParam=K×(1+Dim*Dim), where K isthe number of components in the mixture, N is the number of data points,Dim is the dimensionality of the variables.

Hierarchical Clustering

Light is one important visual cue in any photo. The light-content in theimages may be modeled in the first level of the hierarchy. Flash is anexternal light source which influences the ambient light during a photoshoot. Thus, in one exemplary method of the present invention, photosmay be separated with and without flash in the first level of hierarchy.In the second level, images may be clustered based on the ambientlighting condition. The LogLightMetric may be used to estimate theambient light. Then, the hierarchical clustering algorithm may beimplemented separately on these two sets.

Clustering based on Amount of Light. The LogLightMetric was modeledusing a mixture of Gaussians. The number of Gaussians were chosen basedon the BIC value. Referring to FIGS. 6( a) and 6(b) respectively, thehistograms of the LogLightMetric on images shot without and with flashrespectively is shown. FIGS. 7( a) and 7(b) show the Gaussian clusterslearned on the LogLightMetric distributions. FIGS. 8( a) and 8(b) showthe BIC values for a varying number of clusters.

Clustering based on DOF. In the next level, the Depth of Field may bemodeled for distribution in the images. DOF depends on the aperturesize, focal length and subject distance. But only some selected cameramodels (two out of the seven examined) have subject distance as aseparate parameter in the Optical Meta layer. Generally indoor shotshave lower subject distances than outdoor shots. Further, the amount oflight in indoor and outdoor images varies widely. Employing knownclassifications techniques using light only, one may then be able tocluster images as either outdoor or indoor. Given this exemplary initialclassification, the image content may be estimated using aperturediameter and focal length. In one instance, given the parameters of anindoor photo, a short focal length, and low DOF, an image may be ofparticular objects, portraits etc, while one with a longer focal lengthand shallow DOF could be of a smaller target e.g., food, indoordecorations, babies, etc. Since focal length and aperture diameter aretwo independent parameters (the former related to the lens and thelatter related to the aperture opening), they were modeled independentlyas a mixture of Gaussians. Thus for a particular log-light cluster L,there are F Gaussians for focal length and D Gaussians for diameter.Hence in the second level of the hierarchy, cluster L may be further subdivided into F×D clusters. In each of the cases the number of clustersis decided by the optimal BIC value. FIG. 9 shows the diameter and focallength Gaussians on some of the selected first level clusters both fromFlash Fired and Flash Not Fired sets.

Interpretation of Unsupervised Classes

The 30 thousand image dataset was then divided into two parts of 20thousand and 10 thousand images (randomly chosen). The unsupervisedclustering process was used on the 20 thousand image set. The rest waskept aside for assigning class names and comparison with human inducedclasses. Some observations on the unsupervised clustering are asfollows.

a. As discussed earlier, since Flash is a source of light which modifiesthe ambient light condition, images were separated out with and withoutflash. The hierarchical clustering algorithm may then be implementedseparately on these two sets.b. The exposure time and focal length distributions are highly skewed(FIGS. 4( a) and 4(c)). Less than 1% of the images have exposure timesgreater than 0.5 second. These images may then be filtered out andclustered them separately. They generally turn out to be images of nightilluminations, fireworks etc. Due to the choice of the light metric, theclusters shown in FIGS. 9( a) and 9(b) represent images with differingamounts of ambient light. FIG. 9( a) represents images shot in brightlight, for example, daylight while FIG. 9( b) represents images shot inlow light conditions.c. Most of the images in the dataset have been created by regular pointand shoot digital cameras, which typically have wide angle lenses. FIG.4( c) shows that photos with a focal length greater than 100 mm arehighly sparse (except that there is a peak at 300 mm, which is from astream of photos of a sporting event). Photos with high focal lengthwere separated out and modeled independently. FIG. 10( b) shows thedistribution of the diameter and focal lengths for this set. It mayinterest some to note that they also have a very high aperture diameter(>30 mm). This is because images with telephoto lenses have very shallowDOF. Hence the aperture diameter is high.d. In the second level of the hierarchy, clustering was done based onthe diameter and focal length distribution. Clusters with low focallength, low diameter will have wide angle images with high DOF. Clusterswith high focal length and large diameter will contain images of anobject zoomed into (with shallow DOF).e. FIG. 10 shows focal length aperture diameter scatter plots for twolight clusters. FIG. 10( a) is the plot for the second light cluster inimages shot with flash. The vertical lines testify to the hypothesisthat the focal length and diameters are independent optical parametersin camera. FIG. 10( b) is the scatter plot for the focal length versusdiameter scatter plot of photos chosen from the leftmost light clusterin FIG. 7( b). It also has the sets of vertical lines. Further, some mayfind it interesting to see that points are arranged in straight lines ofconstant slope. Each of these constant slope lines correspond to anf-number (ratio between focal length and aperture). Also it is seenaccording to this analysis that high focal length images (>60 mm) rarelyhave low apertures. This may be intuitive because with high focal lengthpeople generally focus on a particular object and they need shallow DOF(high aperture) for that.

The Survey and Human Induced Classes

A survey was conducted about popular image classes generally shot byamateurs. The survey was conducted among roughly thirty people who wereasked to come up with common class names they would assign to the imagesin their personal albums. The survey revealed that class names depend onthe human perception and background knowledge of an image, for example,an image of people in a scenery can be classified as ‘outdooractivities’, ‘hikes/trips’, ‘people in landscape’, ‘group of people’etc. From the feedback the fifty five most common class names werechosen for the study. These class labels were then assigned to twothousand images randomly selected from the 10 thousand image hold outset. Since the classes are not mutually exclusive, each image wasassigned to as many classes as it seemed coherent to the person involvedin tagging.

Comparison of Human Induced Classes and Unsupervised Clusters Withreference to FIG. 12, to evaluate the effectiveness of the optical metalayer in classification of images, the unsupervised clusters may becompared with the human induced classes. The unsupervised clusters maybe called: Optical Clusters and Human Induced Class Names as HumanClasses. A mapping, F was performed between these two sets showing thecorrespondence between the optical clusters and human classes. Further,this mapping was defined as ‘soft’. Each edge between a Human Class andOptical Cluster has a weight. This weight signifies the strength of thebelief of the correspondence. The weight was found by the followingprobabilistic framework. Let HumClass be a discrete random variable overthe set of Human Class indices, and OptClus be a discrete randomvariable over the set of Optical Cluster indices. Thus one may want toevaluate the conditional probability: P(HumClass|OptClus), to expressthe strength of correspondence between HumClass and OptClus. From thetagged dataset, there may be a set of images D_(i) for each of theHumClass values i. Thus,

$\begin{matrix}{{P\left( {HumClass} \middle| {OptClus} \right)} = {{P\left( {OptClus} \middle| {HumClass} \right)}{P({HumClass})}}} \\{= {{P\left( {D_{i}ɛ\mspace{11mu} {OptClus}} \middle| {HumClass}_{i} \right)}{P\left( {HumClass}_{i} \right)}}}\end{matrix}$

Thus the results can be expressed as a table of Log likelihoods betweenOptical Clusters and Human Classes [Table 1 which follows].

TABLE 1 Ranked human semantic concepts for each clusters. OpticalClusters Human Induced Classes OC1 Group of Single Person Portraits ofPeople (−2.1) Indoors (−2.5) People (−3) OC2 People At Views of PublicPlaces Talk By Dinner (−4.0) Rooms/ Indoors (−7) Speaker Offices (−4.5)(−14) OC3 City Streets Vehicles/Cars Buildings/ Public (−5.21) (−5.3)Architectures Places (−6) Outdoors (−6.3) OC4 People In Fireworks (−7)Theaters/ Bars and Auditoriums Restaurant (−8) (−6.1) OC5 DailyBuildings/ People in front Group of Activities Houses (−2.5) ofBuildings People Outdoors (−5) Outdoors (−1.92) (−7.2) OC6 SignboardsPoster/ Outdoor (−7.2) Whiteboards Decorations (−8.1) (−9.5) OC7 MoonlitIlluminations at Birds Eye At Stage Scenes (−12) Night (−13.4) Night(−16) Shows (−20) OC8 Indoor Food (−16.2) People at Decorations Meetings(−15) (−19.2) OC9 Lake/Oceans Mountains Landscape/ People in (−1.37)(−1.5) Scenery (−2) Scenery −2.67 OC10 Sunset (−3.2) SilhouetteIlluminations at (−4.5) Night (−10) OC11 Wildlife Sports (−3.2) Bird'sEye Trees/ (−1.83) View (−4.6) Forests (−6)

Interpretation of Results

The hierarchical unsupervised algorithm returned multiple sets ofclusters. The first set is for images without flash. There are thirtythree clusters in the set. Next is a set of twenty nine clusters forimages with flash. There are two clusters for images with very highfocal length (>100 mm) and six clusters for images with very highexposure time (>0.5 sec). Eleven optical clusters were selected andshowed the most likely human induced class for each of them. Each row inthe figure corresponds to an optical cluster (OC). OC1 has predominantlyindoor images shots with short focal length and low DOF (people posingfor photo, portraits etc). OC2 are indoor images with similar lights butlonger focal length and larger DOF. OC3 is in different lightingconditions altogether. It is a cluster of outdoor objects like streets,cars and buildings. OC4 is of dark indoors like bars/restaurants orfireworks in dark outdoors. OC6 is of outdoor objects which have beenfocused into. OC7 is dark outdoors like moonlit scenes and illuminationsor stage shows etc. OC9 and OC10 are of high focal lengths of which theambient light in OC9 (sceneries/landscapes) is more than OC10 (sunsets).OC11 is of images with very high focal lengths (sports/wildlife etc).

Annotation of Untagged Images

To explore the robustness of the algorithm for prediction of tags inuntagged images, a set of images was collected which have not been usedeither for building the unsupervised model or for creating the humantagged set. For each test image the probability of each of theunsupervised cluster was found. Heuristically, the first ten clusterswere chosen having largest probability. Next, for each cluster, the topfive human induced classes were chosen and their probability weightedwith the cluster specific probability. Mathematically, the process canbe expressed as:

P(HumClass_(i)|TestImage) = Σ_(k)P(HumClass_(i)|OptClus_(k))P(OptClus_(k)|TestImage).

P(OptClus_(k)|TestImage) can be obtained using the parameters of theGaussian and P(HumClass_(i)|OptClus_(k)) has been generated by thetagged image set. The list may be ranked based on the conditionalprobability. In most cases, the probability decreases smoothly or bysmaller steps until there is a sudden large drop. For example, FIG. 18shows exemplary human induced classes for some test images until thelikelihood drop. The results are presented in tabular format (Table 2)where an Image column 200 is comprised of test images that were runthrough the probability process and their respective annotation resultscan be seen in the column of Predicted Classes 300. In one test sampleimage, Image 1 (205), annotations such as “Outdoor Night”, “People inRestaurants”, “Theater”, “Stage Show”, “Talk By Speaker”, “Portrait atNight”, and “Public Indoor Places” were generated before a drop-offthreshold in likelihood of connection is reached.

The Automatic Annotation Framework

An automatic annotation framework may be based on the relevance modelapproach. Referring to FIG. 13, a Bayesian network model is defined toshow the interaction among content and contextual information in animage. An image (I) is treated as a bag of blocks and words generatedfrom the same process. These blocks come from a finite set of blockvocabulary and the words come from a finite set of word vocabulary. Oneway the blocks are generated includes dicing up each image in a trainingcorpus into fixed sized square blocks. For each block, a number of lowlevel features like color histogram, color moments, texture and edgehistograms are computed. Assuming that a feature vector is of size (F)and there are (M) blocks in each image and (N) photos in the trainingcorpus, a result of (MN) block scan be obtained. Using a k meansalgorithm on these (MN) blocks finds k representative blocks. So the outblock vocabulary size is (K). One would assume all images have beengenerated by selecting some blobs from this vocabulary space. Hence,each image can be characterized by a discrete set of block numbers ofsize (M). By heuristically deriving a value for (K), for example, 500,the responses in a sample survey of thirty people who submitted commonnouns for photos they shot came up with a vocabulary size of 50 words.

Referring to FIG. 13( a), a Bayes Net model is shown. Assuming a giventraining corpus of images (T) (not shown), each image (I)ε(T) isrepresented by a vector of discrete values (B) of size (M) which denotewhich block from the vocabulary has been used to create the image. Alsoassociated with each image is a set of words or annotations (W). Thus,(I)=(B,W). Automatic annotation can then be used to predict the (W)sassociated with an untagged image based on (B). Thus, for an untaggedimage, the probability of a word being assigned to the image can becomputed based on (B). Let (I) be the random variable over all images inthe training corpus, (B) be the random variable over the block vectorsand (W) be the random variable over the words. In the Bayes Net model,(B) becomes the observed variable. The conditional probability of a wordgiven a set of blocks is computed according to the following equation:

P(w|B)=Σ_(I) P(w,I|B)=Σ_(I) P(w|I,B)P(I|B)αΣ_(I) P(w|I)P(B|I)P(I)

P(w|I) and P (B|I) can be learned from the training corpus afteradequate smoothing. This would be a baseline model however; it does notconsider the contextual information which is present in a digital photo.Hence, referring to FIG. 13( b), a model is proposed integrating bothcontent and context. The content and contextual information is assignedto images in optical clusters (O) using an untagged image database.Whenever a new image (X) comes, it may be assigned to a cluster O_(j)having a maximum value for P(X|O_(j)). Here O is a random variable overall clusters and it is observed. Thus the probability of a word in ablock vector given the pixel feature blocks and the optical contextinformation can be computed as in the following equation:

P(w|B,O)=Σ_(I) P(w,I|B,O)=Σ_(I) P(w|I,B,O)P(I|B,O)Σ_(I)P(w|I)P(B,O|I)P(I)=Σ_(I) P(w|I)P(B|I)P(O|I)P(I)

Each (0) is represented as a Gaussian cluster whose parameters arelearnt using the algorithm for the clustering model.

Indoor/Outdoor Classification

To show the efficacy of the Optical Meta Layer, a two classclassification problem may be solved. First, the camera parameters maybe used to classify photos either as indoor shots or outdoor shots. As abaseline, the raw camera parameters (focal length, exposure time, flash,diameter, ISO value) are used. Next the latent variable LogLightMetricfor classification is used. As shown in Table 3, the accuracy improvesby 3% and the F-Measure improves by 5% if the latent variable is usedfor classification. Also one may see that the optical metadata are bythemselves quite efficient in distinguishing between indoor and outdoorshots.

TABLE 3 Results of Indoor Outdoor classification using Optical Metadataonly Mean Mean Absolute F-Measure F-Measure Type of Model Accuracy ErrorIndoors Outdoors Raw Camera Parameters 90.5 0.13 0.89 0.90LogLightMetric 93.86 0.09 0.93 0.95

TABLE 4 Automatic Annotation on the Entire Vocabulary Set Type of ModelMean Precision Mean Recall Only Content Data 0.46 0.31 Both Content andContext 0.56 0.35

Next the results of automatic image annotation may be shown, first usingonly image features and then using both image features and opticalcontext information. For each photo in the test set, the algorithm findsa probability distribution over the entire set of vocabulary. Heuristictechniques choose the top five words from them as the annotations for animage. If the ground truth annotations (tagged by humans) match any ofthe predicted annotations there is a hit, else there is an error.Precision for a particular tag is the ratio between the number of photoscorrectly annotated by the system to the number of all photosautomatically annotated with the tag. Recall is the number of correctlytagged photos divided by the number of photos manually annotated withthat tag. Table 4 shows the mean precision and mean recall for theentire set of fifty tags in vocabulary set. Table 5 following shows themean precision and mean recall values for the twenty most popular wordsin an example vocabulary set (which have at least 40 representativephotos in the training set). The top row shows the results for the modelin FIG. 13( a) and the bottom row shows the results for the modelproposed in FIG. 13( b).

TABLE 5 Automatic Annotation on the Popular Vocabulary Set Type of ModelMean Precision Mean Recall Only Content Data 0.59 0.46 Both Content andContext 0.76 0.60

In FIGS. 14 and 15 automatic annotations are shown on some test photoswhere “original tags” are the tags inserted manually. The set AutoAnnotation1 are the tags predicted by the baseline model 13(a). The setAuto Annotation2 are the tags predicted by using both content andcontext information 13(b).

In an exemplary embodiment, context information can be used to narrowdown the search space for image retrieval. Keyword based image retrievalcan be done in one of two ways in a searchable environment. If allimages in the database have been annotated with words from thevocabulary along with the corresponding probabilities, one can justretrieve the images having the query tag with a probability higher thana cutoff. In the other approach, one can translate the query words intoimage feature space. Since a joint distribution between blocks and wordsmay be already learned, the most likely block vector can be found givenquery word(s). This blob vector may be used to find the most likelyphotos. In both cases searches may be through the entire image databaseor using the optical metadata under another embodiment of the presentinvention, the retrieval engine may be guided to particular clusters.

Optical clusters previously generated (for example, those shown in FIGS.6 and 7) may be used as representative clusters. One advantage is thatthese clusters can be computed with minimal computational resources andcan be readily interpreted. The tagged photo set was divided intotraining and test sets. For each photo in the training set the mostlikely Gaussian cluster was found and assigned all the tags of thisphoto to that cluster. So after iterating through the entire trainingset, there is a joint distribution between the tags and opticalclusters. Then for each word, clusters were chosen which contribute tothe top P % of its probability mass. These clusters were marked as theonly ones generating all the photos with the tag. For each tagged imagein the test set the most likely cluster was found. Experiments withforty different random samples of training and test images wereperformed. For P=70%, only 30% of the entire image database neededscanning (assuming that each cluster has equal weight in the imagespace). The errors on all the images and tags on the test set werecomputed. Table 6 shows the mean errors for some tags across all testsamples. The mean average error for all tags across forty differentsamples is 0.18.

TABLE 6 Mean Errors for Search Tag Mean Error Buildings/Monument 0.14Sunset 0.15 Illuminations 0.15 Beach 0.28 Indoor Group Photo 0.10Vehicle 0.14 Indoor Artifacts 0.27

Decrease in Search Space

The inferences obtained using the Optical Meta Layer in the presentinvention can significantly decrease the space for image search. Theproposed exemplary algorithms could be used in the top level of ahierarchical retrieval algorithm. This can improve the efficiency bydecreasing the search time and also remove false positives by guidingthe algorithm to the relevant clusters. For instance, the pixels in aphoto of a garden and in a photo of an outdoor sporting event (in afield) may be very much the same. It might be difficult to distinguishthem using color histograms. But the optical parameter ‘focal length’will guide the retrieval algorithm in the correct clusters, as the focallengths of these two types of images are likely to vary widely. Forinstance, some human induced classes were chosen and their distributionover a set of optical clusters was found as seen in FIG. 11. Thehorizontal axis denotes the focal length index of Optical Clusters. Theambient light in clusters decreases from left to right on the horizontalaxis. The DOF changes in cycles on the horizontal axis. This is due tothe multimodal nature of the distributions. For instance, referring toFIG. 16, the photos for city streets can be shot in different lightingconditions, but they may have same DOF distribution. The peakscorrespond to different lights but similar DOF. As another example,indoor parties/people at dinner are typically shot in low lightcondition. This may explain the peaks towards the right end of thespectrum. ‘Public places indoors’ is a broad class. Hence the classesare spread throughout the spectrum.

Next the top optical clusters were selected which together contribute to85% of the mass for a particular human induced class. In the followingTable 7, the ratio of the total weight of these clusters is shownrelative to the entire search space. The ratio was found separately forspaces of photos with flash and for photos without flash. For most ofthe image classes the search space was narrowed down to 30-40% of theentire space using optical meta data alone. Most of the classes relatedto nature or outdoor city life (landscape, city streets, mountains etc)are found concentrated in the images without flash. Portraits aregenerally shot in bright lights and not in dark nights. Hence thedistribution of portraits in the set of images with flash is low.Wildlife photos are shot with high focal lengths (which were modeledseparately and are not shown here). This may explain their lowconcentrations in both the image sets. Certain classes of images (peopleat dinner, group photos at night, and various night scenes) are morelikely to be shot with flash. This explains their concentration in theset of images with flash. Some classes of photos can be pretty broad andcan be shot both with and without flash. These include indoorparties/special occasions, night life, illuminations, public placesindoors (theaters, auditorium).

TABLE 7 Search space decrement for human image concepts Fraction ofImage Fraction of Image Class Names Space W/O Flash Space With FlashCity Streets 0.20896 0.066552 Building/Architecture 0.244246 0.068718Scenery/Landscape 0.362119 0.042299 People in Scenery 0.42766 0.340374Oceans/Lakes 0.362119 0.045448 Mountain 0.330615 0.040133 PortraitsIndoors 0.261055 0.032708 Portraits Outdoors at Day 0.240372 0.052366Group Photo Indoors/at 0.202701 0.341745 Night Group Photo Outdoors at0.42766 0.027392 Day People on Streets 0.253875 0.068718 People in frontof 0.364461 0.372224 Building/Architecture Flowers/Plants 0.2359740.052366 Wildlife 0.121561 0.021597 Trees 0.360587 0.025947 Bird's EyeView 0.365993 0.006289 Furniture/ 0.25317 0.027392 Appliances Pets0.230677 0.047504 Indoor Daily Life 0.246327 0.034874 Indoor Decorations0.365706 0.034874 Views of Rooms 0.182711 0.027392 Illumination at Night0.396726 0.33838 Garden 0.267731 0.029787 Indoor Parties/ 0.0392030.302586 Special Occasions Public Places 0.268222 0.034874 IndoorsSunset 0.336539 0.021368 Sky 0.220874 0.032708 Silhouette 0.3126460.044176 Beach 0.266012 0.063015 Outdoor Decorations 0.356489 0.032708People at Dinner 0.009253 0.032708 Public Places 0.298612 0.33838Indoors Night Scenes 0.329343 0.475733

FIG. 17 schematically represents a series of steps involved in a method100 for classifying digital images using optical parameters according toanother embodiment of the invention, wherein step 110 may involveassembling a set of commonly associated classes. The classes may beformed generally as described hereinabove, for example, with referenceto FIG. 12 and Table 1.

As a non-limiting example, a set of commonly associated classes may beformed using the results of a human induced feedback, for example asurvey.

Step 120 may involve extracting and analyzing optical meta data andpixel values from an image file selected from a database of inputphotos. As a non-limiting example, information such as exposure time,focal length, f-numbers, and flash presence (FIG. 4) may be analyzed foroptical characteristics and the results stored.

Step 130 may involve applying an algorithm to the optical metadata andpixel values to predict probable grouping (meaningful clustering) withother image files having similar measurements. For example, FIGS. 6, 7,and 9 depict clustering of images based on Gaussians of similar ambientlight characteristic results.

Step 140 may involve assigning classification tags to image files to mapprobable class tags based on results of algorithms such as those in Step130. With reference to FIG. 12, one exemplary mapping model depictsvarious cluster groups associated with a sample of human inducedclasses.

Step 150 may involve classifying digital images according to a set ofassociated classes using classification tags such as those in Step 140.With reference to FIGS. 14 and 15, digital images may be assignedmultiple tags and classified into multiple classifications when imageelements satisfy a probable relation to the various classifications.

Step 160 may involve retrieving images with desired classificationswithin a probability of being related to a query word. Referring to FIG.13, a Bayesian model may be used to assign a word to an image using aprobability equation based on content and contextual information. Asdescribed hereinabove, probability algorithms may be used to retrieveimages satisfying a cutoff value. Optional Step 165 divides an imageinto representative blocks of pixel features and optical contextinformation where blocks within a probability cutoff are formed intoblock vectors. Optional Step 167 then clusters together images within amaximum probability value for a word matching the block vector. OptionalStep 170 provides a searchable environment for an image using keywordsthat satisfy the probability value on Step 167.

Thus, while the optical meta layer constitutes only a few bytes ofinformation in a standard 3-4 MB digital camera photo, these bytescontain hidden information about the content of the other layers (Pixeland Human Induced). Further they can be easily extracted and processedwithout too much computational cost. Although the invention has beendescribed primarily with respect to the optical meta layer to generateinferences among digital images, the present invention is not meant tobe limited in this respect. For instance, the information from a humaninduced meta layer (if available) can also help boost the inferencesderived from the optical meta layers. Additionally, the inferences inthe derived meta layer may be obtained from the joint distribution ofoptical, temporal, human induced meta layers.

It should be understood, of course, that the foregoing relates toexemplary embodiments of the invention and that modifications may bemade without departing from the spirit and scope of the invention as setforth in the following claims.

1. A method for classifying digital images, the method comprising:clustering optical parameters of the digital images into a set ofmeaningful clusters; associating the set of meaningful clusters to a setof associated classes used by a user; and classifying the digital imagesaccording to the set of associated classes.
 2. The method forclassifying digital images of claim 1, further comprising: analyzing theoptical parameters for context information and content information. 3.The method for classifying digital images of claim 1, furthercomprising: automatically annotating each digital image with a set oftext for further classification.
 4. The method for classifying digitalimages of claim 1, wherein: the optical parameters are gathered fromEXIF data attached to metadata of the digital images.
 5. The method forclassifying digital images of claim 1, further comprising: assemblingtogether a set of block vocabulary; assembling together a set of wordvocabulary; applying an algorithm to the digital images to predictprobabilities where members of the block and word vocabulary areassociated to the digital images; and associating the probabilities intoblock vectors for determining classification of the images into theassociated classes.
 6. The method for classifying digital images ofclaim 5, further comprising: providing a searchable environment usingquery tags for the digital images; and retrieving digital imagessearched by query tag using the block vectors.
 7. The method forclassifying digital images of claim 1, further comprising: providing asearchable environment using query tags for the digital images; andretrieving images having the query tags with a probability higher than apredetermined cutoff.
 8. A method for organizing digital images, themethod comprising: deriving optical parameters from the digital images;accessing a set of subject classes commonly used by a user and assembledinto predefined parameters; determining what the user was trying tocapture in the digital image by associating the derived opticalparameters with the set of digital image subject classes; and organizingthe digital images into classifications according to the associationsdetermined by the derived optical parameters.
 9. The method fororganizing digital images of claim 8, further comprising: accessing afile of constructed tags; and automatically annotating the digitalimages by associating the classifications with the tags.
 10. The methodfor organizing digital images of claim 8, further comprising: analyzingthe optical parameters for context information and content information.11. The method for organizing digital images of claim 8, furthercomprising: providing a searchable environment using query tags for thedigital images; and retrieving images having the query tags with aprobability higher than a predetermined cutoff.
 12. The method fororganizing digital images of claim 8, further comprising: assemblingtogether a set of block vocabulary; assembling together a set of wordvocabulary; applying an algorithm to the digital images to predict aprobability where members of the block and word vocabulary areassociated to the digital images; and associating the probabilities intoblock vectors for determining classification of the images into theassociated classes.
 13. The method for organizing digital images ofclaim 12, further comprising: providing a searchable environment usingquery tags for the digital images; and retrieving digital imagessearched by query tag using the block vectors.
 14. The method fororganizing digital images of claim 8, further comprising: manuallyassigning text tags to images according to the classifications theimages are associated with.
 15. A method for using optical metadata of adigital image to classify the digital image, the method comprising:analyzing the optical metadata to find clusters of digital images havingsimilar optical concepts; comparing the clusters with human inducedclasses; and corresponding the human induced classes with aclassification for the digital image.
 16. The method of claim 15,wherein: in the step of analyzing, the clusters are derived frommeasurements including ambient light level data.
 17. The method of claim15, wherein: in the step of analyzing, the clusters are derived frommeasurements including depth of focus data.
 18. The method of claim 15,further comprising: accessing a file of constructed tags; andautomatically annotating the digital images by associating theclassifications with the tags.
 19. The method of claim 18, wherein: thestep of automatic annotation includes associating the digital images byclassifications according to extracted optical context data from theoptical metadata.
 20. The method of claim 15, further comprising:providing a searchable environment using query tags for the digitalimages; and retrieving images having the query tags with a probabilityhigher than a predetermined cutoff.